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Abstract 

Background: The University of California, Santa Cruz (UCSC) genome database is among the most used sources of 
genomic annotation in human and other organisms. The database offers an excellent web-based graphical user 
interface (the UCSC genome browser) and several means for programmatic queries. A simple application 
programming interface (API) in a scripting language aimed at the biologist was however not yet available. Here, we 
present the Ruby UCSC API, a library to access the UCSC genome database using Ruby. 

Results: The API is designed as a BioRuby plug-in and built on the ActiveRecord 3 framework for the 
object-relational mapping, making writing SQL statements unnecessary. The current version of the API supports 
databases of all organisms in the UCSC genome database including human, mammals, vertebrates, deuterostomes, 
insects, nematodes, and yeast. 

The API uses the bin index — if available — when querying for genomic intervals. The API also supports genomic 
sequence queries using locally downloaded *.2bit files that are not stored in the official MySQL database. The API is 
implemented in pure Ruby and is therefore available in different environments and with different Ruby interpreters 
(including JRuby). 

Conclusions: Assisted by the straightforward object-oriented design of Ruby and ActiveRecord, the Ruby UCSC API 
will facilitate biologists to query the UCSC genome database programmatically. The API is available through the 
RubyGem system. Source code and documentation are available at https://github.com/misshie/bioruby-ucsc-api/ 
under the Ruby license. Feedback and help is provided via the website at http://rubyucscapi.userecho.com/. 



Background 

The University of California, Santa Cruz (UCSC) gen- 
ome database [1] is one of the most common gateways 
to access genomic sequence and annotation data of 
humans and other organisms. Besides a web-based gen- 
ome browser [2], the database is programmatically ac- 
cessible through three interfaces: the official command- 
line tools and libraries [3,4], the Distributed Annotation 
System (DAS) [5] server, and direct access to a public 
MySQL database server. UCSCs official tools consist of 
command-line executables and API libraries written in 
the C language. The C API widely supports the func- 
tionality of the database with good performance. These 
tools and libraries are available at the Kent source tree 
[3,6]. The UCSC DAS server, which supports previous 
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DAS version 0.95, offers a simple interface for program- 
matic access to the database. However, it has a limitation 
in supported types of annotations and has disadvantages 
in its performance. The public MySQL server, finally, 
offers access to almost the same up-to-date database for 
the genome browser but requires the user to program 
raw SQL statements. Given the pervasive use of script- 
ing languages in this field of research, there is a signifi- 
cant demand for simple APIs that allow construction of 
automated queries in these languages. In particular, the 
Ruby programming language has been widely adopted in 
the bioinformatics domain [7,8]. Libraries including 
BioRuby [9] and the Ruby Ensembl API [10] have shown 
the value of database APIs for Ruby. 

Here, we describe the Ruby UCSC API, an API to 
query the UCSC genome database. 
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Figure 1 Overview of the Ruby UCSC API. The UCSC genome 
database provides a public MySQL server consisting of multiple 
databases, e.g. mm9, hg18, and hg19. Each database contains many 
tables. The Ruby UCSC API and the ActiveRecord framework 
automatically define classes and instance methods corresponding to 
tables and their fields, respectively. 



Implementation 

Object-relational mapping 

The Ruby UCSC API is based on the ActiveRecord 3 
framework— a component of Ruby on Rails [11] —for 
the object-relational mapping (Figure 1). A database in 
the UCSC genome database is represented as a module 
under the Bio::Ucsc name space and a table in the data- 
base is represented as a class (subclass of the ActiveRe- 
cord::Base class) under the database module. For 
example, the "snpl32" table in the human genome as- 
sembly "hgl9" database is referred to as Bio::Ucsc:: 
Hgl9::Snpl32. Query APIs to a table are automatically 
defined from the database schema as class methods fol- 
lowing the ActiveRecords method naming convention. 
For example, if the "snpl32" table has a field (column) 
"name", the Snpl32.find_by_name method is readily 
available. Records (rows) are instances of the corre- 
sponding table class, for which values of any field can be 
obtained. 

Dynamic class definition 

The UCSC database is optimized to serve the genome 
browser, resulting in a very large number of tables 
(about 41,840 tables as MySQL *.MYD files) for which 
the API has to provide access. Furthermore, these data- 
base components are updated frequently. Static defini- 
tions of many table classes would make API code 
maintenance difficult. Therefore, we employed dynamic 



class definition in the Ruby UCSC API. When a table is 
referred to for the first time, the API fetches the data- 
base schema of that table to determine the data types 
and then creates an appropriate Ruby class for that table. 
This lazy generation of the classes also contributes to ac- 
celerate the initialization of this API when compared to 
having static classes for thousands of tables. 

Supporting auxiliary flat files 

A subset of the UCSC genome database, including gen- 
ome sequences, is not stored in the MySQL database 
but needs to be downloaded locally for access. The Ruby 
UCSC API offers methods to access these downloaded 
genome sequences (*.2bit files). 

Dependencies and environment 

The Ruby UCSC API depends on ActiveRecord 3 and is 
designed as a BioRuby plugin using the Biogem system 
[12,13], which organizes RubyGems packages and their 
dependencies for the BioRuby library. 

The Ruby UCSC API is written purely in Ruby. This 
increases the compatibility of the API for various operat- 
ing systems and implementations of the Ruby inter- 
preter. The API currently supports different Ruby 
interpreters including Ruby version 1.9.2 or later, Ruby 
version 1.8.7 or later, and JRuby 1.6.3 or later. 

Results and discussion 

Features and usage 

Figure 2 shows examples of the Ruby UCSC API in use. 
Database connection 

After loading the Ruby UCSC API library (line 2), a con- 
nection to a database can be established by the connect' 
method (line 6). While the default connection is made 
against the UCSC public MySQL server, alternative full or 
partial mirror servers can be used as well (line 9-11). The 
API can connect to multiple databases simultaneously. 

Table query 

Users can query the database by series of "find" class 
methods which are dynamically defined for each table 
class by the ActiveRecord. First of all, find_by_ [field- 
name] and find_all_by_ [field-name] class methods re- 
trieve the first or all matching records, respectively. Ex- 
ample queries for the "name" field are shown in line 13. 
Multiple conditions joined by the _and_ operator are 
also accepted (line 29). According to the ActiveRecords 
convention, values of the other fields in a retrieved rec- 
ord can be referred to by using instance methods 
denoted by the field names (line 14). 
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1 : # load the library 

2: require 'bio-ucsc' 

3 : # shorthand for omitting the "Bio::" prefix 

4: include Bio 

5: # connect to the UCSC's official MySQL server for the Hg19 database 

6: Ucsc: : Hg1 9 . connect 

7 : # connect to a local mirror server for the Hg1 8 database 

8: # hostname, username, and password must be stated explicitly. 

9: Ucsc: :Hg1 8. connect {:db_host => 'localhost', 
10: :db_username => 'genome', 

11: :db_password => ''} 

12: # find the first record by the SNP ID and extract its chromosome number 

13: result = Ucsc : : Hg1 9 : : Snp1 31 . f ind_by_name("rs56289060") 

14: puts result. chrom # => "chrl" 

15: # "with_interval" method searches features on the given region. 

16: # ("wth_interval_excl" methods exclude partly overlapping features) 

17: region = "chrl 7 : 7 , 579 , 61 4-7 , 579 , 700" 

18: puts Ucsc: :Hg19: :Snp131 .with_interval(region) .find(:all) .size # => 9 

19: puts Ucsc: :Hg19: :Snp131 . with_interval_excl(region) . find(:all) .size # => 8 

20: # given the combined search conditions, a SQL statement is generated on demand 

21 : # and the corresponding "bin" numbers are calculated automatically. 

22: condition = Ucsc: :Hg19: :Snp131 . with_interval(region) . select( : name) 

23: puts condition. to_sql # => SELECT name FROM v snp131 v 

24: WHERE (chrom = 'chr17' AND bin in (642,80,9,1,0) 

25: AND ( (chromStart BETWEEN 7579613 AND 7579700) 

26: OR (chromEnd BETWEEN 7579613 AND 7579700) 

27: OR (chromStart <= 7579613 AND chromEND >= 7579700) )); 

28: # apply multiple search conditions by the "find_xxx_and_yyy" method 

29: puts condition . f ind_all_by_class_and_strand("in-del" , "+").size # => 1 

30: # Instances for genePred table records has #exons, #cdss, and #introns methods. 

31 : # These methods return an instance of Bio::Genomiclnterval 

32: hit = Ucsc: :Hg19: : RefGene . f ind_by_name2("UVSSA") 

33: puts "#{hit.exons.size>, #{hit. cdses. size}" # -> "14, 13" 

34: puts hit.exons[0] # => chr4: 1 341 104-1 341 548 

35: puts hit.introns[0] # => chr4: 1 341 549-1 341 877 

36: puts hit.cdss[0] # => chr4: 1 341 880-1 341 977 

37 : # automatic declaration of the table association using the all.joiner schema file 

38: joiner = Bio: :Ucsc: :Schema: : Joiner. load 

39: joiner. variables["gbd"] = ["hg19"] 

40: joiner . def ine_association(Ucsc: :Hg19: :Snp131) 

41: puts Ucsc: :Hg19: :Snp131 . f ind_by_name("rs242") .snp131Seq. first. file.offset # => 1112 

42: # manual declaration of the table association 

43 : Ucsc : : Hg1 9 : : KnownGene . class_eval do 

44: has_one : knownToEnsembl , { : primary_key => :name, :foreign_key => :name} 

45: end 

46: puts Ucsc: :Hg1 9: :KnownGene. first. name # => "uc001aaa3" 

47: puts Ucsc: :Hg1 9: :KnownGene. first. knownToEnsembl. value # => "ENST00000456328" 

48: # load a locally-stored sequence file, and extract partial seqeunce 

49: seq = Ucsc : : File : : Twobit . open("hg1 9 . 2bit") 

50: puts seq.subseq("chr1 : 9990-1 0009") # => "NNNNNNNNNNNTAACCCTAA" 
Figure 2 Example script using Ruby UCSC API. 



Query by genomic intervals 

A genomic interval can be expressed by a string like 
"chrl:123,456-456,789" as used in the graphical web 
interface of the UCSC Genome Browser (line 17). An 
interval query condition is passed by the with_interval 
method (line 18-19). This method automatically absorbs 
the difference of genomic coordinate conventions be- 
tween intuitive 1 -based coordinates and database in- 
ternal 0-based coding system (compare line 17 and 
25-27). The with_interval method allows retrieving all 
features that are overlapping with the given interval (line 
18). Instead, the with_interval_excl method only returns 
features that lie completely within the region and 



features partially overlapping with the region are 
excluded (line 19). 

Bin indexing system 

To achieve high query performance for large tables, the 
UCSC database uses a bin indexing system [6]. In this 
system, genomic positions in a chromosome are sepa- 
rated into hierarchies of bins that are sized into 
512Mbase, 64Mbase, 8Mbase, IMbase and 128kbase. 
Any annotation in a genomic interval is stored in the 
minimum sized bin that encompasses the whole interval. 
For a genomic interval query, if the target table has a 
"bin" field, the API automatically calculates a list of bins 
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that potentially contain annotations for the interval and 
applies the list to generate an SQL statement to narrow 
the target record. This is a key feature of the API be- 
cause multiple queries for genomic intervals without 
using the bin index take excessive times, especially for 
large tables such as dbSNP. 

Building SQL statements 

Methods to specify search conditions, such as with_in- 
terval, select, where, order, limit and group, can be com- 
bined by chaining (line 22-27). When a find method or 
one of the methods to access arrays (such as find_all, 
first and []) is called for the condition, the constructed 
SQL query is executed and the results are returned 
(line 29). 

Methods to access information of exons, CDSs, and introns 

Instances of "genePred" table classes, such as RefGene, 
EnsGene, and KnownGene, have exons, cdss, and 
introns methods. These methods return arrays of Bio:: 
Genomiclnterval objects sorted according to the gene 
strand (line 32-36). 

Table association 

The joiner schema file describes the links between the 
tables of the UCSC genome database. The Bio::Ucsc:: 
Schema::Joiner.load class method takes an URI of the 
schema file. If the URI is not given, UCSC s all.joiner file 
[14] is used (line 38). The format of the joiner file is 
documented in the Kent source tree [15]. Variables in 
the joiner file can be overwritten. For example, overwrit- 
ing the gbd variable that stores whole databases can re- 
strict databases used for the link search (line 39). The 
define_association method takes a table class and defines 
all the associations of given table (line 40). Unconnected 
databases and undefined tables are ignored during defin- 
ition. Linked results are always returned as an array (line 

Table 1 Supported databases 

Clade/organism Databases 



41). The table association also can be defined manually. 
When a record in a table can be joined with a record of 
another table by sharing the same value (foreign key), 
the has_one / has_many methods are used to declare the 
association (line 43-45). Once the table association is 
declared, a table can refer to the associated table using a 
method of its record object (line 46-47). 

Retrieval of genomic sequences 

Extraction of genomic sequences in the given genomic 
intervals is a frequent task. The UCSC genome database 
does not store the genomic sequences in the MySQL 
databases. Instead, they provide the sequences as *.2bit 
files. These files are usually processed by UCSC s tools 
written in C. To improve the compatibility, we imple- 
mented the same functionalities in Ruby. With the Bio:: 
Ucsc::File::Twobit class, *.2bit files are interpreted in 
Ruby and subsequences can be extracted by the subseq 
method (line 49-50). 

Current limitations 

The current version of Ruby UCSC API uses informa- 
tion of the joiner schema file to find table associations. 
The all.joiner file, however, describes additional informa- 
tion of including which tables are chromosome- rather 
than genome-based, field values that have to be trans- 
formed to define table associations, and tables with ex- 
ceptional structures. In future versions, the API will use 
this information to make user scripts simpler and to fol- 
low database structure updates immediately. So far, 
manual definition of table associations still has an ad- 
vantage in performance by minimizing table association 
definitions, especially in some tables that have compli- 
cated associations. 

For some tables including subsets of the Encyclopedia 
of DNA Elements (ENCODE) [16], the actual data are 
not stored in the MySQL database itself but are stored 



human Hg19, Hg18 

mammals chimp (PanTro3), orangutan (PonAbe2), rhesus (RheMac2), marmoset (CalJac3), mouse (Mm9), rat (Rn4), guinea pig (CavPor3), 

rabbit (OryCun2), cat (FelCat4), panda (AilMeM), dog (CanFam2), horse (EquCab2), pig (SusScr2), sheep (OviAril), cow (BosTau4), 
elephant (LoxAfr3), opossum (MonDom5), platypus (OrnAnal) 

vertebrates chicken (GalGaB), zebra finch (TaeGutl), lizard (AnoCar2), X. tropicalis (XenTro2), zebrafish (DanRer7), tetraodon (TetNig2), fugu 

(Fr2), stickleback (GasAcul), medaka (Oryl_at2), lamprey (PetMarl) 

deuterostomes lancelet (BraFlol), sea squirt (02), sea urchin (StrPur2) 

insects D.melanogaster (Dm3), D.simulans (DroSiml), D.sechellia (DroSecl), D.yakuba (DroYak2), D.erecta (DroErel), D.ananassae (DroAna2), 

D.pseudoobscura (Dp3), D.persimilis (DroPerl), D.virilis (DroVir2), D.mojavensis (DroMoj'2), D.grimshawi (DroGril), Anopheles 
mosquito (AnoGaml), honey bee (ApiMel2) 

nematodes Celegans (Ce6), Cbrenneri (CaePb3), Cbriggsae (Cb3), Cremanei (CaeRem3), Cjaponica (CaeJapl), P.pacificus (PriPacI) 

others sea hare (AplCaM), yeast (SacCer2) 

common Go, HgFixed, Proteome, UniProt, VisiGene 

databases 
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as references to BigWig, BigBed [17] and BAM [18] files. 
BigWig and BigBed can be accessed by the UCSC tools 
in C BAM files can be processed by third-party tools 
such as Samtools [18], and Picard [19]. To date, the 
Ruby UCSC API does not support these yet, however, 
users can use the bio-samtools BioRuby plugin [20] for 
these tasks. 

Existing UCSC APIs for scripting languages 

APIs for the UCSC genome database using scripting lan- 
guages are still limited. For Perl, the Genoman module 
[21] offers interfaces to databases including the UCSC 
genome database. For Python, the Cruzdb library [22] 
offers an SQLAlchemy-based API for the UCSC genome 
database. The biggest advantage of Ruby UCSC API 
described here is that Ruby and the Active Record frame- 
work enable simplified query and retrieved record descrip- 
tion. Moreover, the Ruby UCSC API does not depend on 
UCSCs command-line tools. This makes its installation 
easier and increases interoperability for various environ- 
ments including a Java-based Ruby interpreter, JRuby. 

Conclusions 

UCSC s official executables and C libraries are the most 
comprehensive and fastest API for the UCSC genome 
database; however, APIs for scripting languages still have 
significant advantages for users because their concern is 
not only a runtime speed but also a total time required 
for the programming to obtain the results. The Ruby 
UCSC API offers effective productivity and can therefore 
have a significant impact in the field. 

The Ruby UCSC API already supports all organisms in 
the UCSC genome database (Table 1). In future releases, 
more comprehensive supports for new organisms and 
older or updated genome assemblies will be added. 

The Ruby UCSC API is freely available as a Rubygem 
package. Source code and documentations are also available 
at https://github.com/misshie/bioruby-ucsc-api/. Documen- 
tation and feedback are available at the UserEcho site at 
http://rubyucscapi.userecho.com/. 

Availability and requirements 
Project name: The Ruby UCSC API 
Project home page: 

https://github.com/misshie/bioruby-ucsc-api 
Feedback and help: http://rubyucscapi.userecho.com/ 
Operation systems: Platform independent 
Programming language: Ruby 

Other requirements: Ruby interpreter (Ruby 1.8.7 or 
later, Ruby 1.9.2 or later, or JRuby 1.6.3 or later), and 
ActiveRecord (version 3.0.7 or later). 
License: The Ruby License 

Any restrictions to use by non-academics: none 
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